A constrained hierarchical rule extraction method based on phrase collocations and high-frequency backbone words

نویسندگان

  • Jinsong Su
  • Yajuan Lv
  • Qun Liu
چکیده

Hierarchical-phrase based machine translation model is a popular translation model which combines advantages of phrase-based translation models and syntax-based translation models. However, since there are no linguistic constraints in the procedure of current hierarchical phrase extraction, there are a large number of redundant generalized rules extracted. In this paper, we propose two strategies to limit the extraction of hierarchical rules and eliminate the number of redundant rules: first, we identify the phrase collocations with the log likelihood ratio, and then we require the phrase collocations should be packed as a whole during the extraction; second, we distinguish the backbone words using the frequency, and then set the limit during extraction that the sub phrases which consist of only backbone words can not be replaced with variables. Experimental results show that our methods substantially reduce the number of generalized rules and have no significant decrease in BLEU score.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

TCtract-A Collocation Extraction Approach for Noun Phrases Using Shallow Parsing Rules and Statistic Models

This paper presents a hybrid method for extracting Chinese noun phrase collocations that combines a statistical model with rule-based linguistic knowledge. The algorithm first extracts all the noun phrase collocations from a shallow parsed corpus by using syntactic knowledge in the form of phrase rules. It then removes pseudo collocations by using a set of statistic-based association measures (...

متن کامل

A Hybrid Extraction Model for Chinese Noun/Verb Synonym bi-gram Collocations

Statistical-based collocation extraction approaches suffer from (1) low precision rate because high co-occurrence bi-grams may be syntactically unrelated and are thus not true collocations; (2) low recall rate because some true collocations with low occurrences cannot be identified successfully by statistical-based models. To integrate both syntactic rules as well as semantic knowledge into a s...

متن کامل

Left-to-Right Hierarchical Phrase-based Machine Translation

Hierarchical phrase-based translation (Hiero for short) models statistical machine translation (SMT) using a lexicalized synchronous context-free grammar (SCFG) extracted from word aligned bitexts. The standard decoding algorithm for Hiero uses a CKY-style dynamic programming algorithm with time complexity O(n3) for source input with n words. Scoring target language strings using a language mod...

متن کامل

Next or Beyond Next: Effect of Contrastive Phrase-Based Treatment on Stage Gain Across Self-Paced and More Time-Constrained Tasks

This study explored the effect of contrastive phrase resynthesis instruction ongaining the teachability hypothesis stages in self-paced versus time-constrained oralproduction and recognition. Three groups (i.e., 23 learners) of high beginner femalelearners in an English language institute were randomly selected from a cohort oflearners. One group received contrastive metalinguistic instruction ...

متن کامل

Improving Collocation Extraction for High Frequency Words

The purpose of this paper is to introduce an alternative word association measure aimed at addressing the under-extraction collocations that contain high frequency words. While measures such as MI provide the important contribution of filtering out sheer high frequency of words in the detection of collocations in large corpora, one side effect of this filtering is that it becomes correspondingl...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009